arXiv:1502.05110v2 [cs.CR] 29 May 2015 


CDStore: Toward Reliable, Secure, and Cost-Efficient Cloud Storage via 

Convergent Dispersal 

Mingqiang Li* Chuan Qin, and Patrick R C. Lee 
Department of Computer Science and Engineering, The Chinese University of Hong Kong 
mingqiangli.cn@gmail.com, {cqin,pclee}@cse.cuhk.edu.hk 


Abstract 

We present CDStore, which disperses users’ backup data 
across multiple clouds and provides a unified multi¬ 
cloud storage solution with reliability, security, and cost- 
efficiency guarantees. CDStore builds on an augmented 
secret sharing scheme called convergent dispersal, which 
supports deduplication by using deterministic content- 
derived hashes as inputs to secret sharing. We present 
the design of CDStore, and in particular, describe how 
it combines convergent dispersal with two-stage dedupli¬ 
cation to achieve both bandwidth and storage savings and 
be robust against side-channel attacks. We evaluate the 
performance of our CDStore prototype using real-world 
workloads on LAN and commercial cloud testbeds. Our 
cost analysis also demonstrates that CDStore achieves a 
monetary cost saving of 70% over a baseline cloud stor¬ 
age solution using state-of-the-art secret sharing. 

1 Introduction 

Cloud storage provides cost-efficient means for organi¬ 
zations to host backups off-site | [40| . However, from 
users’ perspectives, putting all data in one cloud raises 
reliability concerns regarding the single point of fail¬ 
ure and vendor lock-in especially when cloud 
storage providers can spontaneously terminate their busi¬ 
ness | [35| . Cloud storage also raises security concerns, 
since data management is now outsourced to third par¬ 
ties. Users often want their outsourced data to be pro¬ 
tected with guarantees of confidentiality (i.e., data is kept 
secret from unauthorized parties) and integrity (i.e., data 
is uncorrupted). 

Multi-cloud storage coalesces multiple public cloud 
storage services into a single storage pool, and provides 
a plausible way to realize both reliability and security 
in outsourced storage. It disperses data with some form 
of redundancy across multiple clouds, operated by inde¬ 
pendent vendors, such that the stored data can be recov¬ 
ered from a subset of clouds even if the remaining clouds 
are unavailable. Redundancy can be realized through 
erasure coding (e.g., Reed-Solomon codes (ID) or se¬ 
cret sharing (e.g., Shamir’s scheme 0)- Recent multi¬ 
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cloud storage systems (e.g., |[5 1^ ^ 33 ^) leverage 
erasure coding to tolerate cloud failures, but do not ad¬ 
dress security; DepSky 0 uses secret sharing to further 
achieve both reliability and security. Secret sharing often 
comes with high redundancy, yet its variants are shown 
to reduce the redundancy of secret sharing to be slightly 
higher than that of erasure coding, while achieving secu¬ 
rity in the computational sense (see ©• Secret sharing 
has a side benefit of providing keyless security (i.e., elim¬ 
inating encryption keys), which builds on the difficulty 
for an attacker to compromise multiple cloud services 
rather than a secret key. This removes the key manage¬ 
ment overhead as found in key-based encryption |[56| . 


However, existing secret sharing algorithms prohibit 
storage savings achieved by deduplication. Since backup 
data carries substantial identical content | [58| , organiza¬ 
tions often use deduplication to save storage costs, by 
keeping only one physical data copy and having it shared 
by other copies with identical content. On the other hand, 
secret sharing uses random pieces as inputs when gen¬ 
erating dispersed data. Users embed different random 
pieces, making the dispersed data different even if the 
original data is identical. 


This paper presents a new multi-cloud storage system 
called CDStore, which makes the first attempt to provide 
a unified cloud storage solution with reliability, secu¬ 
rity, and cost efficiency guarantees. CDStore builds on 
our prior proposal of an enhanced secret sharing scheme 
called convergent dispersal | [T7| , whose core idea is to 
replace the random inputs of traditional secret sharing 
with deterministic cryptographic hashes derived from the 
original data, while the hashes cannot be inferred by at¬ 
tackers without knowing the whole original data. This 
allows deduplication, while preserving the reliability and 
keyless security features of secret sharing. Using con¬ 
vergent dispersal, CDStore offsets dispersal-level redun¬ 
dancy due to secret sharing by removing content-level 
redundancy via deduplication, and hence achieves cost 
efficiency. To summarize, we extend our prior work p7| 
and make three new contributions. 


First, we propose a new instantiation of convergent 
dispersal called CAONT-RS, which builds on AONT-RS 
15^ . CAONT-RS maintains the properties of AONT-RS, 
and makes two enhancements: (i) using OAEP-based 










AONT pQ| to improve performance and (ii) replacing 
random inputs with deterministic hashes to allow dedu¬ 
plication. Our evaluation also shows that CAONT-RS 
generates dispersed data faster than our prior AONT-RS- 
based instantiation fTTl . 

Second, we present the design and implementation of 
CDStore. It adopts two-stage deduplication, which first 
deduplicates data of the same user on the client side to 
save upload bandwidth, and then deduplicates data of 
different users on the server side to further save storage. 
Two-stage deduplication works seamlessly with conver¬ 
gent dispersal, achieves bandwidth and storage savings, 
and is robust against side-channel attacks |[^|^. We 
also carefully implement CDStore to mitigate computa¬ 
tion and I/O bottlenecks. 

Finally, we thoroughly evaluate our CDStore proto¬ 
type using both microbenchmarks and trace-driven ex¬ 
periments. We use real-world backup and virtual im¬ 
age workloads, and conduct evaluation on both LAN 
and commercial cloud testbeds. We show that CAONT- 
RS encoding achieves around 180MB/s with only two- 
thread parallelization. We also identify the bottlenecks 
when CDStore is deployed in a networked environment. 
Furthermore, we show via cost analysis that CDStore can 
achieve a monetary cost saving of 70% via deduplication 
over AONT-RS-based cloud storage. 

2 Secret Sharing Algorithms 

We conduct a study of the state-of-the-art secret shar¬ 
ing algorithms. A secret sharing algorithm operates by 
transforming a data input called secret into a set of coded 
outputs called shares, with the primary goal of providing 
both fault tolerance and confidentiality guarantees for the 
secret. Formally, a secret sharing algorithm is defined 
based on three parameters (n, k, r): an (n, k, r) secret 
sharing algorithm (where n > k > r > 0) disperses a 
secret into n shares such that (i) the secret can be recon¬ 
structed from any k shares, and (ii) the secret cannot be 
inferred (even partially) from any r shares. 

The parameters (n^k^r) define the protection strength 
of a secret sharing algorithm. Specifically, n and k de¬ 
termine the fault tolerance degree of a secret, such that 
the secret remains available as long as any k out of n 
shares are accessible. In other words, it can tolerate the 
loss ofn — k shares. The parameter r determines the 
confidentiality degree of a secret, such that the secret re¬ 
mains confidential as long as no more than r shares are 
compromised by an attacker. On the other hand, a secret 
sharing algorithm makes the trade-off of incurring addi¬ 
tional storage. We define the storage blowup as the ratio 
of the total size of n shares to the size of the original se¬ 
cret. Note that the storage blowup must be at least ^, as 
the secret is recoverable from any k out of n shares. 

Several secret sharing algorithms have been proposed 
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Table 1: Comparison of secret sharing algorithms. 


in the literature. Table compares them in terms of the 
confidentiality degree and the storage blowup, subject to 
the same n and k. Two extremes of secret sharing algo¬ 
rithms are Shamir’s secret sharing scheme (SSSS) |[54| 
and Rabin’s information dispersal algorithm (IDA) |5Q| . 
SSSS achieves the highest confidentiality degree (i.e., 
r = k — 1), but its storage blowup is n (same as repli¬ 
cation). IDA has the lowest storage blowup but its 
confidentiality degree is the weakest (i.e., r = 0), and 
any share can reveal the information of the secret. Ramp 
secret sharing scheme (RSSS) |T^ generalizes both IDA 
and SSSS to make a trade-off between the confidential¬ 
ity degree and the storage blowup. It evenly divides a 
secret into k — r pieces, and generates r additional ran¬ 
dom pieces of the same size. It then transforms the k 
pieces into n shares using IDA. 

Secret sharing made short (SSMS) combines IDA 
and SSSS using traditional key-based encryption. It first 
encrypts the secret with a random key and then disperses 
the encrypted secret and the key using IDA and SSSS, 
respectively. Its storage blowup is slightly higher than 
that of IDA, while it has the highest confidentiality de¬ 
gree r = /c — 1 as in SSSS. Note that the confidentiality 
degree is defined in the computational sense, that is, it is 
computationally infeasible to break the encryption algo¬ 
rithm without knowing the key. 

AONT-RS | [5^ further reduces the storage blowup of 
SSMS, while preserving the highest confidentiality de¬ 
gree r = k — 1 (in the computational sense). It combines 
Rivest’s all-or-nothing transform (AONT) | for con¬ 
fidentiality and Reed-Solomon coding GZlI for fault 
tolerance. It first transforms the secret into an AONT 
package with a random key, such that an attacker can¬ 
not infer anything about the AONT package unless the 
whole package is obtained. Specifically, it splits a secret 
into a number 5 > 1 of words, and adds an extra ca¬ 
nary word for integrity checking. It masks each of the 
s words by XOR’ing it with an index value encrypted 
by a random key. The s masked words are placed at the 
start of an AONT package. One more word, obtained 
by XOR’ing the same random key with the hash of the 
masked words, is added to the end of the AONT package. 
The final AONT package is then divided into k equal-size 
shares, which are encoded into n shares using a system- 
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Figure 1: CDStore architecture. 

atic Reed-Solomon code (a systematic code means that 
the n shares include the original k shares). 

The security of existing secret sharing algorithms lies 
in the embedded random inputs (e.g., a random key in 
AONT-RS). Due to randomness, secrets with identical 
content lead to distinct sets of shares, thereby prohibiting 
deduplication. This motivates CDStore, which enables 
secret sharing with deduplication. 

3 CDStore Design 

CDStore is designed for an organization to outsource the 
storage of data of a large group of users to multiple cloud 
vendors. It builds on the client-server architecture, as 
shown in Figure Each user of the same organization 
runs the CDStore client to store and access its data in 
multiple clouds over the Internet. In each cloud, a co¬ 
locating virtual machine (VM) instance owned by the 
organization runs the CDStore server between multiple 
CDStore clients and the cloud storage backend. 

CDStore targets backup workloads. We consider a 
type of backups obtained by snapshotting some applica¬ 
tions, file systems, or virtual disk images. Backups gen¬ 
erally have significant identical content, and this makes 
deduplication useful. Field measurements on backup 
workloads show that deduplication can reduce the stor¬ 
age overhead by 10 x on average, and up to 50 x in some 
cases | [58| . In CDStore deployment, each user machine 
submits a series of backup files (e.g., in UNIX tar for¬ 
mat) to the co-located CDStore client, which then pro¬ 
cesses the backups and uploads them to all clouds. 

3.1 Goals and Assumptions 

We state the design goals and assumptions of CDStore in 
three aspects: reliability, security, and cost efficiency. 
Reliability: CDStore tolerates failures of cloud storage 
providers and even CDStore servers. Outsourced data is 
accessible if a tolerable number of clouds (and their co¬ 
locating CDStore servers) are operational. CDStore also 
tolerates client-side failures by offloading metadata man¬ 
agement to the server side (see §4.3| ). In the presence 
of cloud failures, CDStore reconstructs original secrets 
and then rebuilds the lost shares as in Reed-Solomon 
codes HD- We do not consider cost-efficient repair p9} . 


Security: CDStore exploits multi-cloud diversity to 
ensure confidentiality and integrity of outsourced data 
against outsider attacks, as long as a tolerable number 
of clouds are uncompromised. Note that the confiden¬ 
tiality guarantee requires that the secrets be drawn from 
a very large message space, so that brute-force attacks 
are infeasible |i0|. CDStore also uses two-stage dedu¬ 
plication (see §3.3| ) to avoid insider side-channel attacks 
|[^[^ launched by malicious users. Here, we do not 
consider strong attack models, such as Byzantine faults 
in cloud services GD- We also assume that the client- 
server communication over the network is protected, so 
that an attacker cannot infer the secrets by eavesdropping 
the transmitted shares. 

Cost efficiency: CDStore uses deduplication to reduce 
both bandwidth and storage costs. It also incurs limited 
overhead in computation (e.g., VM usage) and storage 
(e.g., metadata). We assume that there is no billing for 
the communication between a co-locating VM and the 
storage backend of the same cloud, based on today’s pric¬ 
ing models of most cloud vendors | [30| . 

3.2 Convergent Dispersal 

Convergent dispersal enables secret sharing with dedu¬ 
plication by replacing the embedded random input with a 
deterministic cryptographic hash derived from the secret. 
Thus, two secrets with identical content must generate 
identical shares, making deduplication possible. Also, 
it is computationally infeasible to infer the hash with¬ 
out knowing the whole secret. Our idea is inspired by 
convergent encryption p4| used in traditional key-based 
encryption, in which the random key is replaced by the 
cryptographic hash of the data to be encrypted. Figure 
shows the main idea of how we augment a secret sharing 
algorithm with convergent dispersal. 

This paper proposes a new instantiation of conver¬ 
gent dispersal called CAONT-RS, which inherits the re¬ 
liability and security properties of the original AONT- 
RS, and makes two key modifications. First, to improve 
performance, CAONT-RS replaces Rivest’s AONT 
with another AONT based on optimal asymmetric en¬ 
cryption padding (OAEP) |TH[20| . The rationale is that 
Rivest’s AONT performs multiple encryptions on small- 
size words (see Q, while OAEP-based AONT performs 
a single encryption on a large-size, constant-value block. 
Also, OAEP-based AONT provably provides no worse 
security than any AONT scheme Second, CAONT- 
RS replaces the random key in AONT with a determin¬ 
istic cryptographic hash derived from the secret. Thus, 
it preserves content similarity in dispersed shares and al¬ 
lows deduplication. Our prior work p7| also proposes 
instantiations for RSSS (T^ and AONT-RS (based on 
Rivest’s AONT) | [52| . Our new CAONT-RS shows faster 
encoding performance than our prior AONT-RS-based 
instantiation (see §5.3| ). 
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Figure 2: Idea of convergent dispersal. Figure 3: Example of CAONT-RS with n = 4 and k = 3. 


We now elaborate on the encoding and decoding of 
CAONT-RS, both of which are performed by a CDStore 
client. Figure shows an example of CAONT-RS with 
n = 4 and k = 3 (and hence r = k — 1 = 2). 

Encoding: We first transform a given secret X into a 
CAONT package. Specifically, we first generate a hash 
key h, instead of a random key, derived from X using a 
(optionally salted) hash function H (e.g., SHA-256): 

h = H{X). (1) 

To achieve confidentiality, we transform {X^h) into 
a CAONT package {Y,t) using OAEP-based AONT, 
where Y and t are the head and tail parts of the CAONT 
package and have the same size as X and h, respectively. 
To elaborate, Y is generated by: 

Y = XeG{h), (2) 

where ‘0’ is the XOR operator and G is a generator 

function that takes h as input and constructs a mask block 
with the same size as X. Here, we implement the gener¬ 
ator G as: 

G(/i) = E(/i,C), (3) 

where C is a constant-value block with the same size as 
A, and E is an encryption function (e.g., AES-256) that 
encrypts C using h as the encryption key. 

The tail part t is generated by: 

t = /i0H(y). (4) 

Finally, we divide the CAONT package into k equal- 

size shares (we pad zeroes to the secret if necessary to 
ensure that the CAONT package can be evenly divided). 
We encode them into n shares using the systematic Reed- 
Solomon codes | [T7l|46l[47l|5T| . 

To enable deduplication, we ensure that the same share 
is located in the same cloud. Since the number of clouds 
for multi-cloud storage is usually small, we simply dis¬ 
perse shares to all clouds. Suppose that CDStore spans 
n clouds, which we label - ,n — 1. After encod¬ 

ing each secret using convergent dispersal, we label the 
n generated shares 0,1, • • • , n — 1 in the order of their 
positions in the Reed-Solomon encoding result, such that 
share i is to be stored on cloud i, where 0 < i < n — 1. 


This ensures that the same cloud always receives the 
same share from the secrets with identical content, ei¬ 
ther generated by the same user or different users. This 
also enables us to easily locate the shares during restore. 
Decoding: To recover the secret, we retrieve any k out of 
n shares and use them to reconstruct the original CAONT 
package (F, t). Then we deduce hash h by XOR’ing t 
with H(y) (see Equation (|^). Finally, we deduce secret 
X by XOR’ing Y with G{h) (see Equation (|^), and re¬ 
move any padded zeroes introduced in encoding. 

We can also verify the integrity of the deduced secret 
X. We simply generate a hash value from the deduced 
X as in Equation Q and compare if it matches h. If the 
match fails, then the decoded secret is considered to be 
corrupted. To obtain a correct secret, we can follow a 
brute-force approach, in which we try a different subset 
of k shares until the secret is correctly decoded [ [T9| . 
Remarks: We briefiy discuss the security properties of 
CAONT-RS. CAONT-RS ensures confidentiality against 
outsider attacks, provided that an attacker cannot gain 
unauthorized accesses to k out of n clouds, and ensures 
integrity through the embedded hash in each secret. It 
leverages AONT to ensure that no information of the 
original secret can be inferred from fewer than k shares. 
We note that an attacker can identify the deduplication 
status of the shares of different users and perform brute- 
force dictionary attacks SID inside the clouds, and we 
require that the secrets be drawn from a large message 
space (see §3.1| ). To mitigate brute-force attacks, we may 
replace the hash key in CAONT-RS with a more sophisti¬ 
cated key generated by a key server with the trade-off 

of introducing the key management overhead. 

3.3 Two-Stage Deduplication 

We first overview how deduplication works. Deduplica¬ 
tion divides data into fixed-size or variable-size chunks. 
This work assumes variable-size chunking, which de¬ 
fines boundaries based on content and is robust to con¬ 
tent shifting. Each chunk is uniquely identified by a fin¬ 
gerprint computed by a cryptographic hash of the chunk 
content. Two chunks are said to be identical if their fin¬ 
gerprints are the same, and fingerprint collisions of two 
different chunks are very unlikely in practice (T5) Dedu¬ 
plication stores only one copy of a chunk, and refers any 


















































duplicate chunks to the copy via small-size references. 

To realize deduplication in cloud storage, a naive ap¬ 
proach is to perform global deduplication on the client 
side. Specifically, before a user uploads data to a cloud, 
it first generates fingerprints of the data. It then checks 
with the cloud by fingerprint for the existence of any du¬ 
plicate data that has been uploaded by any user. Finally, 
it uploads only the unique data to the cloud. Although 
client-side global deduplication saves upload bandwidth 
and storage overhead, it is susceptible to side-channel 
attacks |[^[^. One side-channel attack is to infer the 
existence of data of other users | [^ . Specifically, an at¬ 
tacker generates the fingerprints of some possible data of 
other users and queries the cloud by fingerprint if such 
data is unique and needs to be uploaded. If no upload 
is needed, then the attacker infers that other users own 
the data. Another side-channel attack is to gain unautho¬ 
rized access to data of other users p7| . Specifically, an 
attacker uses the fingerprints of some sensitive data of 
other users to convince the cloud of the data ownership. 


To prevent side-channel attacks, CDStore adopts two- 
stage deduplication, which eliminates duplicates first on 
the client side and then on the server side. We require 
that each CDStore server maintains a deduplication in¬ 
dex that keeps track of which shares have been stored by 
each user and how shares are deduplicated (see imple¬ 
mentation details in ^4.4). Then the two deduplication 
stages are implemented as follows. 


Intra-user deduplication: A CDStore client first runs 
deduplication only on the data owned by the same user, 
and uploads the unique data of the user to the cloud. 
Before uploading shares to a cloud, the CDStore client 
first checks with the CDStore server by fingerprint if it 
has already uploaded the same shares. Specifically, the 
CDStore client first sends the fingerprints generated from 
the shares to the CDStore server. The CDStore server 
then looks up its deduplication index, and replies to the 
CDStore client a list of share identifiers that indicate 
which shares have been uploaded by the CDStore client. 
Finally, the CDStore client uploads only unique shares to 
the cloud based on the list. 


Inter-user deduplication: A CDStore server runs dedu¬ 
plication on the data of all users and stores the glob¬ 
ally unique data in the cloud storage backend. After the 
CDStore server receives shares from the CDStore client, 
it generates a fingerprint from each share (instead of us¬ 
ing the one generated by the CDStore client for intra¬ 
user deduplication), and checks if the share has already 
been stored by other users by looking up the dedupli¬ 
cation index. It stores only the unique shares that are 
not yet stored at the cloud backend. It also updates the 
deduplication index to keep track of which user owns the 
shares. Here, we cannot directly use the fingerprint gen¬ 
erated by the CDStore client for intra-user deduplication. 


Otherwise, an attacker can launch a side-channel attack, 
by using the fingerprint of a share of other users to gain 
unauthorized access to the share dZlED- 
Remarks: Two-stage deduplication prevents side- 

channel attacks by making deduplication patterns inde¬ 
pendent across users’ uploads. Thus, a malicious insider 
cannot infer the data content of other users through dedu¬ 
plication occurrences. 

Both intra-user and inter-user deduplications effec¬ 
tively remove duplicates. Intra-user deduplication elimi¬ 
nates duplicates of the same user’s data. This is effective 
for backup workloads, since the same user often makes 
repeated backups of the same data as different versions 
Inter-user deduplication further removes duplicates 
of multiple users. For example, multiple users within the 
same organization may share a large proportion of busi¬ 
ness files. Some workloads exhibit large proportions of 
duplicates across different users’ data, such as VM im¬ 
ages (3T), workstation file system snapshots [ [4^ , and 
backups | [58| . The removal of duplicates translates to 
cost savings (see §5.6| ). 

4 CDStore Implementation 

We present the implementation details of CDStore. Our 
CDStore prototype is written in C-i-i- on Linux. We 
use OpenSSL 0 to implement cryptographic opera¬ 
tions: AES-256 and SHA-256 for the encryption and 
hash algorithms of convergent dispersal, respectively, 
and SHA-256 for fingerprints in deduplication. We use 
GF-Complete | [48| to accelerate Galois Field arithmetic 
in the Reed-Solomon coding of CAONT-RS. 

4.1 Architectural Overview 

We follow a modular approach to implement CDStore, 
whose client and server architectures are shown in Fig¬ 
ure 1^ During file uploads, a CDStore client splits the 
file into a sequence of secrets via the chunking module. 
It then encodes each secret into n shares via the cod¬ 
ing module. It performs intra-user deduplication, and up¬ 
loads unique shares to the CDStore servers in n different 
clouds via both client-side and server-side communica¬ 
tion modules. To reduce network FOs, we avoid sending 
many small-size shares over the Internet. Instead, we first 
batch the shares to be uploaded to each cloud in a 4MB 
buffer and upload the buffer when it is full. Upon receiv¬ 
ing the shares, each CDStore server performs inter-user 
deduplication via the deduplication module and updates 
the deduplication metadata via the index module. Finally, 
it packs the unique shares as containers and writes the 
containers to the cloud storage backend through the in¬ 
ternal network via the container module. 

File downloads work in the reverse way. A CDStore 
client connects to any k clouds to request to download 
a file. Each CDStore server retrieves the corresponding 
containers and metadata, and returns all required shares 
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Figure 4: Implementation of the CDStore architecture. 


and file metadata. The CDStore client decodes the se¬ 
crets and assembles the secrets back to the file. 


4.2 Chunking 

We implement both fixed-size chunking and variable- 
size chunking in the chunking module of a CDStore 
client, and enable variable-size chunking by default. To 
make deduplication effective, the size of each secret 
should be on the order of kilobytes (e.g., 8KB |[6^). We 
implement variable-size chunking based on Rabin finger¬ 
printing in which the average, minimum, and max¬ 
imum secret (chunk) sizes are configured at 8KB, 2KB, 
and 16KB, respectively. 


4.3 Metadata Offloading 

One important reliability requirement is to tolerate 
client-side failures, as we expect that a CDStore client 
is deployed in commodity hardware. Thus, our current 
implementation makes CDStore servers keep and man¬ 
age all metadata on behalf of CDStore clients. 

When uploading a file, a CDStore client collects two 
types of metadata. First, after chunking, it collects file 
metadata for the upload file, including the full pathname, 
file size, and number of secrets. Second, after encoding 
a secret into shares, it collects share metadata for each 
share, including the share size, fingerprint of the share 
(for intra-user deduplication), sequence number of the 
input secret, and secret size (for removing padded zeroes 
when decoding the original secret). 

The CDStore client uploads the file and share metadata 
to the CDStore servers along with the uploaded file. The 
metadata will serve as input for each CDStore server to 
maintain index information (see ^4.4). 

We distribute metadata across all CDStore servers for 
reliability. For non-sensitive information (e.g., the size 
and sequence number of each secret), we can simply 
replicate it, so that each CDStore server can directly use 
it to manage data transfer and deduplication. However, 
for sensitive information (e.g., a file’s full pathname), we 
encode and disperse it via secret sharing. 


files and keep it in the index module. There are two types 
of index structures: the file index and the share index. 

The file index holds the entries for all files uploaded 
by different users. Each entry describes a file, identi¬ 
fied by the full pathname (which has been encoded as 
described in §4.3| ) and the user identifier provided by a 
CDStore client. We hash the full pathname and the user 
identifier to obtain a unique key for the entry. The entry 
stores a reference to the file recipe, which describes the 
complete details of the file, including the fingerprint of 
each share (for retrieving the share) and the size of the 
corresponding secret (for decoding the original secret). 
The file recipe will be saved at the cloud backend by the 
container module (see §4.5| ). 

The share index holds the entries for all unique shares 
of different files. Each entry describes a share, and is 
keyed by the share fingerprint. It stores the reference 
to the container that holds the share. To support intra¬ 
user deduplication, each entry also holds a list of user 
identifiers to distinguish who owns the share, as well as 
a reference count for each user to support deletion. 

Our prototype manages file and share indices using 
LevelDB 1^ , an open-source key-value store. Lev- 
elDB maintains key-value pairs in a log-structured merge 
(LSM) tree | [44| , which supports fast random inserts, up¬ 
dates, and deletes, and uses a Bloom filter | [T^ and a 
block cache to speed up lookups. We can also leverage 
the snapshot feature provided by LevelDB to store peri¬ 
odic snapshots in the cloud backend for reliability. We 
currently do not consider this feature in our evaluation. 


4.5 Container Management 

The container module maintains two types of contain¬ 
ers in the storage backend: share containers, which hold 
the globally unique shares, and recipe containers, which 
hold the file recipes of different files. We cap the con¬ 
tainer size at 4MB, except that if a file recipe is very 
large (due to a particularly large file), we keep the file 
recipe in a single container and allow the container to go 
beyond 4MB. We avoid splitting a file recipe in multiple 
containers to reduce FOs. 

We make two optimizations to reduce the I/O overhead 
of storing and fetching the containers via the storage 
backend. First, we maintain in-memory buffers for hold¬ 
ing shares and file recipes before writing them into con¬ 
tainers. We organize the shares or file recipes by users, 
so that each container contains only the data of a single 
user. This retains spatial locality of workloads | [6^ . Sec¬ 
ond, we maintain a least-recently-used (LRU) disk cache 
to hold the most recently accessed containers to reduce 
FOs to the storage backend. 


4.4 Index Management 

Each CDStore server uses the metadata from CDStore 
clients to generate index information of the uploaded 


4.6 Multi-Threading 

Advances of multi-core architectures enable us to ex¬ 
ploit multi-threading for parallelization. Lirst, the client- 























side coding module uses multi-threading for the CPU¬ 
intensive encoding/decoding operations of CAONT-RS. 
We parallelize encoding/decoding at the secret level: in 
file uploads, we pass each secret output from the chunk¬ 
ing module to one of the threads for encoding; in file 
downloads, we pass the shares of a secret received by the 
communication module to a thread for decoding. 

Furthermore, both client-side and server-side commu¬ 
nication modules use multi-threading to fully utilize the 
network transfer bandwidth. The client-side communica¬ 
tion module creates multiple threads, one for each cloud, 
to upload/download shares. The server-side communi¬ 
cation module also uses multiple threads to send/receive 
shares for different CDStore clients. 

4.7 Open Issues 

Our current CDStore prototype implements the basic 
backup and restore operations. We discuss some open 
implementation issues. 

Storage efficiency: We can reclaim more storage space 
via different techniques in addition to deduplication. For 
example, garbage collection can reclaim space of ex¬ 
pired backups. By exploiting historical information, we 
can accelerate garbage collection in deduplication stor¬ 
age | [25| . Compression also effectively reduces storage 
space of both data | [58| and metadata (e.g., file recipes 
pTI). Implementations of garbage collection and com¬ 
pression are posed as future work. 

Scalability: We currently deploy one CDStore server per 
cloud. In large-scale deployment, we can run CDStore 
servers on multiple VMs per cloud and evenly distribute 
user backup jobs among them for load balance. Imple¬ 
menting a distributed deduplication system is beyond the 
scope of this paper. 

Consistency: Our prototype is tailored for backup work¬ 
loads that are immutable. We do not address consistency 
issues due to concurrent updates as mentioned in 0. 

5 Evaluation 

We evaluate CDStore under different testbeds and work¬ 
loads. We also analyze its monetary cost advantages. 

5.1 Testbeds 

We consider three types of testbeds in our evaluation. 

(i) Local machines: We use two machines: Local- 
Xeon, which has a quad-core 2.4GHz Intel Xeon E5530 
and 16GB RAM, and Local-i5, which has a quad-core 
3.4GHz Intel Core 15-3570 and 8GB RAM. Both ma¬ 
chines run 64-bit Ubuntu 12.04.2 LTS. We use them to 
evaluate the encoding performance of CDStore clients. 

(ii) LAN: We configure a LAN of multiple machines 
with the same configuration as Local-i5. All nodes are 
connected via a IGb/s switch. We run CDStore clients 
and servers on different machines. Each CDStore server 
mounts the storage backend on a local 7200RPM SATA 


hard disk. We use the LAN testbed to evaluate the data 
transfer performance of CDStore. 

(in) Cloud: We deploy a CDStore client on the Local- 
Xeon machine (in Hong Kong) and connect it via the In¬ 
ternet to four commercial clouds (i.e., n = 4): Ama¬ 
zon (in Singapore), Google (in Singapore), Azure (in 
Hong Kong), and Backspace (in Hong Kong). We set up 
the testbed in the same continent to limit the differences 
among the client-to-server connection band widths. Each 
cloud runs a VM with similar configurations: four CPU 
cores and 4^ 15GB RAM. We use the cloud testbed to 
evaluate the real deployment performance of CDStore. 

5.2 Datasets 

We use two real-world datasets to drive our evaluation. 

(i) FSL: This dataset is published by the File systems 
and Storage Lab (FSL) at Stony Brook University pp7| . 
Due to the large dataset size, we use the Fslhomes 
dataset in 2013, containing daily snapshots of nine stu¬ 
dents’ home directories from a shared network file sys¬ 
tem. We select the snapshots every seven days (which are 
not continuous) to mimic weekly backups. The dataset is 
represented in 48-bit chunk fingerprints and correspond¬ 
ing chunk sizes obtained from variable-size chunking. 
Our filtered FSL dataset contains 16 weekly backups of 
all nine users, covering a total of 8.11TB of data. 

(ii) VM: This dataset is collected by ourselves and 
is unpublished. It consists of weekly snapshots of 156 
VM images for students in a university programming 
course in Spring 2014. We create a 10GB master image 
with Ubuntu 12.04.2 LTS and clone all VMs. We treat 
each VM image snapshot as a weekly backup of a user. 
The dataset is represented in SHA-1 fingerprints on 4KB 
fixed-size chunks. It spans 16 weeks, totaling 24.38TB 
of data. For fair comparisons, we remove all zero-filled 
chunks (which dominate in VM images | [3T| ) from the 
dataset, and the size reduces to 11.12TB. 

5.3 Encoding Performance 

We evaluate the computational overhead of CAONT- 
RS when encoding secrets into shares. We compare 
CAONT-RS with two variants: (i) AONT-RS | [5^ , which 
builds on Rivest’s AONT [5^ and does not support dedu¬ 
plication, and (ii) our prior proposal CAONT-RS-Rivest 
ITtI , which uses Rivest’s AONT as in AONT-RS and 
replaces the random key in AONT-RS with a SHA-256 
hash for convergent dispersal. CAONT-RS uses OAEP- 
based AONT instead (see §3.2| ). 

We conduct our experiments on the Local-Xeon and 
Local-i5 machines. We create 2GB of random data in 
memory (to remove I/O overhead), generate secrets using 
variable-size chunking with an average chunk size 8KB, 
and encode them into shares. We measure the encoding 
speed, defined as the ratio of the original data size to the 
total time of encoding all secrets into shares. Our results 
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Figure 5: Encoding speeds of a CDStore client. 
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are averaged over 10 runs. We observe similar results for 
decoding, and omit them here. 

We first examine the benefits of multi-threading (see 
§4.6| ). Figure [5^ shows the encoding speeds versus the 
number of threads, while we fix (n^k) = (4,3). The 
encoding speeds of all schemes increase with the num¬ 
ber of threads. If two encoding threads are used, the 
encoding speeds of CAONT-RS are 83MB/s on Local- 
Xeon and 183MB/S on Local-i5. Also, OAEP-based 
AONT in CAONT-RS brings remarkable performance 
gains. Compared to CAONT-RS-Rivest, which performs 
encryptions on small words based on Rivest’s AONT, 
CAONT-RS improves the encoding speed by 40^61% 
on Local-Xeon and 54^61% on Local-i5; even though 
compared to AONT-RS, which uses one fewer hash op¬ 
eration, CAONT-RS still increases the encoding speed by 
12^35% on Local-Xeon and 19^27% on Local-i5. 

We next evaluate the impact of n (number of clouds). 
We vary n from 4 to 20, and fix two encoding threads. 
We configure k as the largest integer that satisfies f | 
(e.g., n = 4 implies k = 3), so as to maintain a similar 
storage blowup due to secret sharing. Figure [5(b)] shows 
the encoding speeds versus n. The encoding speeds of 
all schemes slightly decrease with n (e.g., by 8% from 
n = 4 to 20 for CAONT-RS on Local-i5), since more 
encoded shares are generated via Reed-Solomon codes 
for a larger n. However, Reed-Solomon coding only 
accounts for small overhead compared to AONT, which 
runs cryptographic operations. We have also tested other 
ratios of ^ and obtained similar speed results. 

The above results only report encoding speeds, while 
a CDStore client performs both chunking and encod¬ 
ing operations when uploading data to multiple clouds. 
We measure the combined chunking (using variable-size 


chunking) and encoding speeds with {n,k) = (4, 3) and 
two encoding threads, and find that the combined speeds 
drop by around 16%, to 69MB/s on Local-Xeon and 
154MB/S on Local-i5. 

5.4 Deduplication Efficiency 

We evaluate the effectiveness of both intra-user and inter¬ 


user deduplications (see §3.3| ). We extract the deduplica¬ 
tion characteristics of both datasets, assuming that they 
are stored as weekly backups. We define four types of 
data: (i) logical data, the original user data to be encoded 
into shares, (ii) logical shares, the shares before two- 
stage deduplication, (iii) transferred shares, the shares 
that are transferred over Internet after intra-user dedupli¬ 
cation, and (iv) physical shares, the shares that are finally 
stored after two-stage deduplication. We also define two 
metrics: (i) intra-user deduplication saving, which is one 
minus the ratio of the size of the transferred shares to 
that of the logical shares, and (ii) inter-user deduplica¬ 
tion saving, which is one minus the ratio of the size of 
the physical shares to that of the transferred shares. We 
fix (n, /c) = (4,3). Figure [^summarizes the results. 

Figure |6(a)| first shows the intra-user and inter-user 
deduplication savings. The intra-user deduplication sav¬ 
ings are very high for both datasets, especially in subse¬ 
quent backups after the first week (at least 94.2% for FSL 
and at least 98.0% for VM). The reason is that the users 
only modify or add a small portion of data. The sav¬ 
ings translate to performance gains in file uploads (see 


^5.5). However, the inter-user deduplication savings dif¬ 
fer across datasets. For the FSL dataset, the savings fall 
to no more than 12.9%. In contrast, for the VM dataset, 
the saving for the first backup reaches 93.4%, mainly be¬ 
cause the VM images are initially installed with the same 




































operating system. The savings for subsequent backups 
then drop to the range between 11.8% and 47.0%. Nev¬ 
ertheless, the VM dataset shows higher savings for sub¬ 
sequent backups than the FSL dataset; we conjecture the 
reason is that students make similar changes to the VM 
images when doing programming assignments. 

Figure [6^ then shows cumulative data and share sizes 
before and after intra-user and inter-user deduplications. 
After 16 weekly backups, for the FSL dataset, the total 
size of physical shares is only 0.51TB, about 6.3% of the 
logical data size; for the VM dataset, the total size of 
physical shares is only 0.09TB, about 0.8% of the logi¬ 
cal data size. This shows that dispersal-level redundancy 
(i.e., ^ = |) is significantly offset by removing content- 
level redundancy via two-stage deduplication. Also, if 
we compare the sizes of transferred shares and physical 
shares for the VM dataset, we see that inter-user dedupli¬ 
cation is crucial for reducing storage space. 


5.5 Transfer Speeds 

Single-client baseline transfer speeds: We first evalu¬ 
ate the baseline transfer speed of a CDStore client us¬ 
ing both LAN and cloud testbeds. Each testbed has one 
CDStore client and four CDStore servers with (n, k) = 
(4, 3). We first upload 2GB of unique data (i.e., no dupli¬ 
cates), then upload another 2GB of duplicate data iden¬ 
tical to the previous one, and finally download the 2GB 
data from three CDStore servers (for the cloud testbed, 
we choose Google, Azure, and Backspace for down¬ 
loads). We measure the upload and download speeds, 
averaged over 10 runs. 

Figure 7(a) presents the results. On the LAN testbed, 
the upload speed for unique data is 77MB/s. Our mea¬ 
surements find that the effective network speed in our 
LAN testbed is around 1 lOMB/s. Thus, the upload speed 
for unique data is close to ^ of the effective network 
speed. Uploading duplicate data has speed 150MB/s. 
Since it does not transfer actual data after intra-user 
deduplication, the performance is bounded by the chunk¬ 
ing and CAONT-RS encoding operations (see ^5.3). The 
download speed is 99MB/s, about 10% less than the ef¬ 
fective network speed. The reason is that the CDStore 
servers need to retrieve data from the disk backend be¬ 
fore returning it to the CDStore client. 

On the cloud testbed, the upload and download per¬ 
formance is limited by the Internet bandwidth. For ref¬ 
erences, we measure the upload and download speeds 
of each individual cloud when transferring 2GB of 
unique data divided in 4MB units (see §4.1| ), and Ta¬ 
ble presents the averaged results over 10 runs. Since 
CDStore transfers data through multiple clouds in paral¬ 
lel via multi-threading, its upload speed of unique data 
and download speed are higher than those of individual 
clouds (e.g., Amazon and Google). The upload speed for 
unique data is smaller than the download speed because 


Cloud 

Upload speed 

Download speed 

Amazon 

5.87 (0.19) 

4.45 (0.30) 

Google 

4.99 (0.23) 

4.45 (0.21) 

Azure 

19.59 (1.20) 

13.78 (0.72) 

Rackspace 

19.42 (1.06) 

12.93 (1.47) 


Table 2: Measured speeds (MB/s) of each of four clouds, 
in terms of the average (standard deviation) over 10 runs. 


of sending redundancy and connecting to more clouds. 
The upload speed for duplicate data is over 9 x that for 
unique data, and this difference is more significant than 
on the LAN testbed. 

Single-client trace-driven transfer speeds: We now 

evaluate the upload and download speeds of a single 
CDStore client using datasets as opposed to unique and 
duplicate data above. We focus on the FSL dataset, 
which allows us to test the effect of variable-size chunk¬ 
ing. We again consider both LAN and cloud testbeds 
with {n,k) = (4,3). Since the FSL dataset only has 
chunk fingerprints and chunk sizes, we reconstruct a 
chunk by writing the fingerprint value repeatedly to a 
chunk with the specified size, so as to preserve content 
similarity. Each chunk is treated as a secret, which will 
be encoded into shares. We first upload all backups to 
CDStore servers, followed by downloading them. To re¬ 
duce evaluation time, we only run part of the dataset. On 
the LAN testbed, we run seven weekly backups for five 
users (L06TB data in total). We feed the first week of 
backups of each user one by one through the CDStore 
client, followed by the second week of backups, and so 
on. On the other hand, on the cloud testbed, we run two 
weekly backups for a single user (21.35GB data in total). 

Figure [7(b)] presents three results: (i) the average up¬ 
load speed for the first backup (averaged over five users 
for the LAN testbed), (ii) the average upload speed for 
the subsequent backups, and (iii) the average download 
speed of all backups. The presented results are obtained 
from a single run, yet the evaluation time is long enough 
to give steady-state results. We compare the results with 
those for unique and duplicate data in Figure [7(a)l 

We see that the upload speed for the first backup ex¬ 
ceeds that for unique data (e.g., by 19% on the LAN 
testbed), mainly because the first backup contains dupli¬ 
cates, which can be removed by intra-user deduplication 
(see Figure [6(^ . The upload speed for the subsequent 
backups approximates to that for duplicate data, as most 
duplicates are again removed by intra-user deduplication. 

The trace-driven download speed is lower than the 
baseline one in Figure [7^ (e.g., by 10% on the LAN 
testbed), since deduplication now introduces chunk frag¬ 
mentation for subsequent backups. Nevertheless, 
we find that the variance of the download speeds of the 
backups is very small (not shown in the figure), although 
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Figure 7: Upload and download speeds of a CDStore client (the numbers are Figure 8: Aggregate upload speeds 
the speeds in MB/s). of multiple CDStore clients. 


the number of accessed containers increases for subse¬ 
quent backups. The download speed will gradually de¬ 
grade due to fragmentation as we store more backups. 
We do not explicitly address fragmentation in this work. 
Multi-client aggregate upload speeds: We evaluate the 
aggregate upload speed when multiple CDStore clients 
connect to multiple CDStore servers. We mainly con¬ 
sider data uploads on the LAN testbed, in which we vary 
the number of CDStore clients, each hosted on a dedi¬ 
cated machine, and configure four CDStore servers with 
(n,/c) = (4,3). All CDStore clients perform uploads 
concurrently, such that each of them first uploads 2GB 
of unique data, and then uploads another 2GB of dupli¬ 
cate data. We measure the aggregate upload speed, de¬ 
fined as the total upload size (i.e., 2GB times the number 
of clients) divided by the duration when all clients finish 
uploads. Our results are averaged over 10 runs. 

Figure [8] presents the aggregate upload speeds for 
both unique and duplicate data, which we observe in¬ 
crease with the number of CDStore clients. For unique 
data, the aggregate upload speed reaches 282MB/s for 
eight CDStore clients. The speed is limited by the net¬ 
work bandwidth and disk I/O, where the latter is for the 
CDStore servers to write containers to disk. If we ex¬ 
clude disk FO (i.e., without writing data), the aggregate 
upload speed can reach 310MB/s (not shown in the fig¬ 
ure), which approximates to the aggregate effective Eth¬ 
ernet speed of /c = 3 CDStore servers. For duplicate 
data, there is no actual data transfer, so the aggregate up¬ 
load speed can reach 572MB/s. Note that the knee point 
at four CDStore clients is due to the saturation of CPU 
resources in each CDStore server. 

5.6 Cost Analysis 

We now analyze the cost saving of CDStore. We com¬ 
pare it with two baseline systems: (i) an AONT-RS-based 
multi-cloud system that has the same levels of reliability 
and security as CDStore but does not support deduplica¬ 
tion, and (ii) a single-cloud system that incurs zero re¬ 
dundancy for reliability, but encrypts user data with ran¬ 
dom keys and does not support deduplication. We aim 
to show that CDStore incurs less cost than AONT-RS 
through deduplication; even though CDStore incurs re¬ 


dundancy for reliability, it still incurs less cost than the 
single-cloud system without deduplication. 

We develop a tool to estimate the monetary costs us¬ 
ing the pricing models of Amazon EC2 and S3 
in September 2014. Free charges apply to data trans¬ 
fers between co-locating EC2 instances and S3 storage, 
and also inbound transfers to both EC2 and S3. We only 
study backup operations, and do not consider restore op¬ 
erations as they are relatively infrequent in practice. Note 
that both EC2 and S3 follow tiered pricing, so the exact 
charges depend on the actual usage. Our tool takes into 
account tiered pricing in cost calculations. Eor CDStore, 
we also consider the storage costs of file recipes. 


We briefly describe how we derive the EC2 and S3 
costs. Eor EC2, we consider the category of high- 
utilization reserved instances, which are priced based on 
an upfront fee and hourly bills. We focus on two types 
of instances, namely compute-optimized and storage- 
optimized, to host CDStore servers on all clouds. Each 
instance charges around US$60~ 1,300 per month, de¬ 
pending on the CPU, memory, and storage settings. Note 
that both file and share indices (see ^4.4) are kept in the 
local storage of an EC2 instance, and the total index size 
is determined by how much data is stored and how much 
data can be deduplicated. Our tool chooses the cheap¬ 
est instance that can keep the entire indices according 
to the storage size and deduplication efficiency, both of 
which can be estimated in practice. On the other hand, 
S3 storage is mainly priced based on storage size, and 
it charges around US$30 per TB per month. Note that 
in backup operations, the costs due to outbound transfer 
(e.g., a CDStore server replies the intra-user deduplica¬ 
tion status to a CDStore client) and storage requests (e.g., 
PUT) are negligible compared to VM and storage costs. 


We consider a case study. An organization schedules 
weekly backups for its user data, for a retention time 
of half a year (26 weeks). We fix (n, k) = (4,3) (i.e., 
we host four EC2 instances for CDStore servers). We 
vary the weekly backup size and the deduplication ratio, 
where the latter is defined as the ratio of the size of logi¬ 
cal shares to the size of physical shares (see §5.4| ). 

Eigure |9(a)| shows the cost savings of CDStore ver¬ 
sus different weekly backup sizes, while we fix the 
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Figure 9: Cost savings of CDStore over an AONT-RS- 
based multi-cloud system and a single-cloud system. 


deduplication ratio as 10x p8| . The cost savings in¬ 
crease with the weekly backup size. For example, if we 
keep a weekly backup size of 16TB, the single-cloud 
and AONT-RS-based systems incur total storage costs 
(with tiered pricing) of around US $ 12,250/month and 
US$16,400/month, respectively; CDStore incurs addi¬ 
tional VM costs of around US$660/month but reduces 
the storage cost to around US$2,880/month, resulting in 
around US$3,540/month in total and thus achieving at 
least 70% of cost savings as a whole. The cost saving of 
CDStore over AONT-RS is higher than that over a sin¬ 
gle cloud, as the former introduces dispersal-level redun¬ 
dancy for fault tolerance. The increase slows down as the 
weekly backup size further increases, since the overhead 
of file recipes becomes significant when the total backup 
size is large while the backups have a high deduplica¬ 
tion ratio ED- Note that the jagged curves are due to the 
switch of the cheapest EC2 instance to fit the indices. 

Figure 9(b) [ shows the cost savings of CDStore versus 
different deduplication ratios, where the weekly backup 
size is fixed at 16TB. The cost saving increases with the 
deduplication ratio. The saving is about 70^80% when 
the deduplication ratio is between 10 x and 50 x. 


6 Related Work 

Multi-cloud storage: Existing multi-cloud storage sys¬ 
tems mainly focus on data availability in the presence of 
cloud failures and vendor lock-ins. For example, Safe- 
Store RACS 0, Scalia ||4g, andNCCloud ||2|| dis¬ 
perse redundancy across multiple clouds using RAID or 
erasure coding. Some multi-cloud systems additionally 
address security. HAIL |T9| proposes proof of retriev- 
ability to support remote integrity checking against data 
corruptions. MetaStorage (H and SPANS tore [ [60| pro¬ 
vide both availability and integrity guarantees by repli¬ 
cating data across multiple clouds using quorum tech¬ 
niques | [39| , but do not address confidentiality. Hy- 
bris | |2T| achieves confidentiality by dispersing encrypted 
data over multiple public clouds via erasure coding and 
keeping secret keys in a private cloud. 

Applications of secret sharing: We discuss several se¬ 
cret sharing algorithms in ^ They have been real¬ 
ized by storage systems. POTSHARDS | [56| realizes 


Shamir’s scheme | [54| for archival storage. ICS tore 
ED achieves confidentiality via key-based encryption, 
where the keys are distributed across multiple clouds via 
Shamir’s scheme. DepSky G3 and SCFS |T^ distribute 
keys across clouds using SSMS p4| . Cleversafe | [5^ 
uses AONT-RS to achieve security with reduced storage 
space. All the above systems rely on random inputs to 
secret sharing, and do not address deduplication. 


Deduplication security: Convergent encryption p4| 
provides confidentiality guarantees for deduplication 
storage, and has been adopted in various storage sys¬ 
tems 0|7 ^ ^ However, the key management 
overheads of convergent encryption are significant p6| . 
Bellare et al. p0| generalize convergent encryption into 
Message-locked encryption (MLE) and provide formal 
security analysis on confidentiality and tag consistency. 
The same authors also prototype a server-aided MLE 
system DupLESS which uses more complicated en¬ 
cryption keys to prevent brute-force attacks. DupLESS 
maintains the keys in a dedicated key server, yet the key 
server is a single point of failure. 


Client-side inter-user deduplication poses new secu¬ 
rity threats, including the side-channel attack | |27|28| and 
some specific attacks against Dropbox | [43| . CDStore ad¬ 
dresses this problem through two-stage deduplication. A 
previous work ED proposes a similar two-stage dedu¬ 
plication approach (i.e., inner-VM and cross-VM dedu¬ 
plications) to reduce system resources for VM backups, 
while our approach is mainly to address security. 


7 Conclusions 

We propose a multi-cloud storage system called CDStore 
for organizations to outsource backup and archival stor¬ 
age to public cloud vendors, with three goals in mind: 
reliability, security, and cost efficiency. The core de¬ 
sign of CDStore is convergent dispersal, which aug¬ 
ments secret sharing with the deduplication capabil¬ 
ity. CDStore also adopts two-stage deduplication to 
achieve bandwidth and storage savings and prevent side- 
channel attacks. We extensively evaluate CDStore via 
different testbeds and datasets from both performance 
and cost perspectives. We demonstrate that dedupli¬ 
cation enables CDStore to achieve cost savings. The 
source code of our CDStore prototype is available at 
http://ansrlab.cse.cuhk.edu.hk/software/cdstore. 
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